Poster popularity by country

This notebook loads data of poster viewership at the SfN 2016 annual meeting, organized by the countries that were affiliated with each poster.

We find that the poster popularity across countries is not significant compare to what is expected by chance.

Import libraries and load data


In [1]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('white')

import pandas as pd

In [2]:
# Load data
df = pd.DataFrame.from_csv('./posterviewers_by_country.csv')
key_N = 'Number of people'

1. Summarize data by country


In [3]:
# 0. Count number of posters from each state
# Calculate mean poster popularity
states = df['Country'].unique()
dict_state_counts = {'Country':states,'count':np.zeros(len(states),dtype=int),'popularity':np.zeros(len(states))}
for i, s in enumerate(states):
    dict_state_counts['count'][i] = int(sum(df['Country']==s))
    dict_state_counts['popularity'][i] = np.round(np.mean(df[df['Country']==s][key_N]),3)
df_counts = pd.DataFrame.from_dict(dict_state_counts)

In [4]:
# Visualize dataframe
# count = total number of posters counted affiliated with that country
# popularity = average number of viewers at a poster affiliated with that country
df_counts.head()


Out[4]:
Country count popularity
0 United States 1540 2.445
1 China 71 2.423
2 Japan 136 2.066
3 Bolivarian Republic of 2 1.500
4 Singapore 8 1.750

2. Poster popularity vs. prevalence

Across states in the United States, we found a positive correlation between the number of posters from a state and the popularity of those posters. We debatably see this again across countries to a trending level of significance (1-tailed p-value = 0.06)


In [5]:
print sp.stats.spearmanr(np.log10(df_counts['count']),df_counts['popularity'])
plt.figure(figsize=(3,3))
plt.semilogx(df_counts['count'],df_counts['popularity'],'k.')
plt.xlabel('Number of posters\nin the state')
plt.ylabel('Average number of viewers per poster')
plt.ylim((-.1,3.6))
plt.xlim((.9,1000))


SpearmanrResult(correlation=0.18329523845178372, pvalue=0.12060771606868086)
Out[5]:
(0.9, 1000)

3. Permutation tests: difference in popularity across countries

In this code, we test if the relative popularity / unpopularity observed for any country is outside what is expected by chance

Here, the most popular and least popular countries are defined by a nonparametric statiscal test between the number of viewers at posters from their country, compared to posters from all other countries.


In [6]:
# Simulate randomized data
Nperm = 100
N_posters = len(df)
rand_statepop = np.zeros((Nperm,len(states)),dtype=np.ndarray)
rand_statepopmean = np.zeros((Nperm,len(states)))
for i in range(Nperm):
    # Random permutation of posters, organized by state
    randperm_viewers = np.random.permutation(df[key_N].values)
    for j, s in enumerate(states):
        rand_statepop[i,j] = randperm_viewers[np.where(df['Country']==s)[0]]
        rand_statepopmean[i,j] = np.mean(randperm_viewers[np.where(df['Country']==s)[0]])

In [7]:
# True data: Calculate all p-values for the difference between 1 state's popularity and the rest
min_N_posters = 10
states_big = states[np.where(df_counts['count']>=min_N_posters)[0]]
N_big = len(states_big)
t_true_all = np.zeros(N_big)
p_true_all = np.zeros(N_big)
for i, state in enumerate(states_big):
    t_true_all[i], _ = sp.stats.ttest_ind(df[df['Country']==state][key_N],df[df['Country']!=state][key_N])
    _, p_true_all[i] = sp.stats.mannwhitneyu(df[df['Country']==state][key_N],df[df['Country']!=state][key_N])
    
pmin_pop = np.min(p_true_all[np.where(t_true_all>0)[0]])
pmin_unpop = np.min(p_true_all[np.where(t_true_all<0)[0]])

print 'Most popular country: ', states_big[np.argmax(t_true_all)], '. p=', str(pmin_pop)
print 'Least popular country: ', states_big[np.argmin(t_true_all)], '. p=', str(pmin_unpop)


Most popular country:  Canada . p= 0.0606850998232
Least popular country:  United States . p= 0.0344062544646

In [8]:
# Calculate minimum p-values for each permutation

# Calculate all p and t values
t_rand_all = np.zeros((Nperm,N_big))
p_rand_all = np.zeros((Nperm,N_big))
pmin_pop_rand = np.zeros(Nperm)
pmin_unpop_rand = np.zeros(Nperm)
for i in range(Nperm):
    for j, state in enumerate(states_big):
        idx_use = range(len(states_big))
        idx_use.pop(j)
        t_rand_all[i,j], _ = sp.stats.ttest_ind(rand_statepop[i,j],np.hstack(rand_statepop[i,idx_use]))
        _, p_rand_all[i,j] = sp.stats.mannwhitneyu(rand_statepop[i,j],np.hstack(rand_statepop[i,idx_use]))
        
    # Identify the greatest significance of a state being more popular than the rest
    pmin_pop_rand[i] = np.min(p_rand_all[i][np.where(t_rand_all[i]>0)[0]])
    # Identify the greatest significance of a state being less popular than the rest
    pmin_unpop_rand[i] = np.min(p_rand_all[i][np.where(t_rand_all[i]<0)[0]])


C:\Users\Scott\Anaconda2\lib\site-packages\numpy\core\_methods.py:82: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)

In [9]:
# Test if most popular and least popular countries are outside of expectation
print 'Chance of a state being more distinctly popular than Canada: '
print sum(i < pmin_pop for i in pmin_pop_rand) / float(len(pmin_pop_rand))
print 'Chance of a state being less distinctly popular than US: '
print sum(i < pmin_unpop for i in pmin_unpop_rand) / float(len(pmin_unpop_rand))


Chance of a state being more distinctly popular than Canada: 
0.7
Chance of a state being less distinctly popular than US: 
0.61

In [ ]: